Skip to content

Conversation

@fawazshah
Copy link

@fawazshah fawazshah commented Apr 22, 2021

This PR aims to add data checkpointing and extra error-handling, as well as improving the readability of code.

  • We now catch errors when calling newspaper.build
  • We increment a new variable error_count when we encounter any errors when downloading/parsing articles, or any NoneType publish dates. If error_count > 10 we skip to the next article. (Previously we only skip if encountering 10 or more NoneType dates only)
  • We remove the unneeded count function parameter
  • We print which number news site out of the total number of sites we are scraping right now (e.g. "NEWS SITE 3 OUT OF 99")
  • We now save scraped data to JSON after each news site is processed rather than at the very end of processing, meaning if the script gets interrupted any data collected so far is saved
  • We remove the default limit parameter in run so it doesn't override the user-inputted limit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants